-
Notifications
You must be signed in to change notification settings - Fork 6
[GCM] Add backfill and main scheduling sdiag stats to slurm monitor #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| ) | ||
|
|
||
| # Reset sdiag counters after collection | ||
| self._reset_sdiag_counters() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we resetting sdiag counter on each collection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that these counters are accumulative if we don't call sdiag reset. Adding a reset call everytime after we collect sdiag stats. This will make the data more meaningful on a timely basis.
@yonglimeta maybe add a cli flag to control this, seems fine resetting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because these sdiag counters are accumulative, if not reset, will keep increasing. Resetting allow us to collect true time-series sdiag data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay to start with this.
I hope these race conditions are somehow covered or may not be a big issue.
1/ Data collected --> sdiag reset
|
--> In parallel, New data came which got reset too.
2/ sdiag resets every midnight. Multiple race condition scenarios here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seconding on the potential data loss between each collection and reset. Also it seems that Slurm natively reset the counter at 12am server time.
|
After merging there are failed checks. The nox-format and typecheck failure seems to point to the sprio update, should not be related to this PR. |
Summary
We would like to add the below backfill and main scheduling sdiag stats into gcm collector:
These will help us to debug slurm controller slowness and responsiveness issue.
Note that these counters are accumulative if we don't call sdiag reset. Adding a reset call everytime after we collect sdiag stats. This will make the data more meaningful on a timely basis.
Test Plan
Run unit test:
python -m pytest gcm/tests/test_slurm.py -vOutput:
Also build and install gcm on fair-rc cluster (within a gcm conda env):
Output:
[{"derived_cluster": "fair-aws-rc-1", "server_thread_count": 1, "agent_queue_size": 0, "agent_count": 0, "agent_thread_count": 0, "dbd_agent_queue_size": 0, "schedule_cycle_max": 30402, "schedule_cycle_mean": 1300, "schedule_cycle_sum": 3151388, "schedule_cycle_total": 2424, "schedule_cycle_per_minute": 1, "schedule_queue_length": 63, "sdiag_jobs_submitted": 769, "sdiag_jobs_started": 691, "sdiag_jobs_completed": 681, "sdiag_jobs_canceled": 1, "sdiag_jobs_failed": 0, "sdiag_jobs_pending": 78, "sdiag_jobs_running": 1, "bf_backfilled_jobs": 46, "bf_cycle_mean": 5846, "bf_cycle_sum": 3852599, "bf_cycle_max": 24428, "bf_queue_len": 62, "nodes_allocated": 0, "nodes_completing": 2, "nodes_down": 0, "nodes_drained": 0, "nodes_draining": 0, "nodes_fail": 0, "nodes_failing": 0, "nodes_future": 0, "nodes_idle": 2, "nodes_inval": 0, "nodes_maint": 0, "nodes_reboot_issued": 0, "nodes_reboot_requested": 0, "nodes_mixed": 0, "nodes_perfctrs": 0, "nodes_planned": 0, "nodes_power_down": 0, "nodes_powered_down": 0, "nodes_powering_down": 0, "nodes_powering_up": 0, "nodes_reserved": 0, "nodes_unknown": 0, "nodes_not_responding": 0, "nodes_unknown_state": 0, "nodes_total": 4, "total_cpus_avail": 448, "total_gpus_avail": 16, "total_cpus_up": 448, "total_gpus_up": 16, "total_cpus_down": 0, "total_gpus_down": 0, "cluster": "fair-aws-rc-1", "running_and_pending_users": 0, "jobs_pending": 0, "gpus_pending": 0, "nodes_pending": 0, "jobs_failed": 0, "jobs_running": 0, "jobs_without_user": 0, "total_cpus_alloc": 0, "total_down_nodes": 0, "total_gpus_alloc": 0, "total_nodes_alloc": 2}]